Differentiable Search Indices (DSIs) encode a corpus of documents in the parameters of a model and use the same model to map queries directly to relevant document identifiers. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents (+12\%). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting by a significant margin. Concretely, it improves the average Hits@10 by $+21.1\%$ over competitive baselines for NQ and requires $6$ times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence.
translated by 谷歌翻译
Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL language models and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet, using only ~50% of the initial dense pretraining sunk cost. The upcycled models also outperform sparse models trained from scratch on 100% of the initial dense pretraining computation budget.
translated by 谷歌翻译
Chromosome analysis is essential for diagnosing genetic disorders. For hematologic malignancies, identification of somatic clonal aberrations by karyotype analysis remains the standard of care. However, karyotyping is costly and time-consuming because of the largely manual process and the expertise required in identifying and annotating aberrations. Efforts to automate karyotype analysis to date fell short in aberration detection. Using a training set of ~10k patient specimens and ~50k karyograms from over 5 years from the Fred Hutchinson Cancer Center, we created a labeled set of images representing individual chromosomes. These individual chromosomes were used to train and assess deep learning models for classifying the 24 human chromosomes and identifying chromosomal aberrations. The top-accuracy models utilized the recently introduced Topological Vision Transformers (TopViTs) with 2-level-block-Toeplitz masking, to incorporate structural inductive bias. TopViT outperformed CNN (Inception) models with >99.3% accuracy for chromosome identification, and exhibited accuracies >99% for aberration detection in most aberrations. Notably, we were able to show high-quality performance even in "few shot" learning scenarios. Incorporating the definition of clonality substantially improved both precision and recall (sensitivity). When applied to "zero shot" scenarios, the model captured aberrations without training, with perfect precision at >50% recall. Together these results show that modern deep learning models can approach expert-level performance for chromosome aberration detection. To our knowledge, this is the first study demonstrating the downstream effectiveness of TopViTs. These results open up exciting opportunities for not only expediting patient results but providing a scalable technology for early screening of low-abundance chromosomal lesions.
translated by 谷歌翻译
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
translated by 谷歌翻译
避免在监督学习中过度拟合的一种常见方法是尽早停止,在训练期间,将持有的设置用于迭代评估,以在训练步骤数量中找到最大概括的训练步骤。但是,这样的方法需要一个不相交的验证集,因此通常为此目的遗漏了训练集的标记数据的一部分,当训练数据稀缺时,这并不理想。此外,当训练标签嘈杂时,模型在验证集中的性能可能不是准确的概括代理。在本文中,我们提出了一种方法,可以在训练迭代中发现早期停止点而无需进行验证集。我们首先表明,在过度参数化的方向上,线性模型的随机初始化权重在训练过程中收敛到同一方向。使用此结果,我们建议训练用不同随机种子初始初始化的线性模型的两个平行实例,并使用它们的交点作为信号来检测过度拟合。为了检测相交,我们在训练迭代过程中使用平行模型的重量之间的余弦距离。注意到NN的最后一层是输出逻辑的前层层激活的线性图,我们使用反事实权重的新概念来建立线性模型的标准,并提出向多层网络的扩展。我们对两个领域进行实验,这些领域的早期停止对防止过度拟合NN具有明显的影响:(i)从嘈杂的标签中学习; (ii)学习在IR中排名。我们在四个广泛使用的数据集上进行的实验证实了我们的概括方法的有效性。对于广泛的学习率,我们的方法称为余弦距离标准(CDC),比我们几乎在所有测试的情况下与所有方法相比的所有方法平均得出更好的概括。
translated by 谷歌翻译
变压器模型的缩放属性引起了很多兴趣。但是,在研究不同电感偏差和模型体系结构的缩放特性的效果的前提下,没有做太多事情。模型体系结构的规模不同吗?如果是这样,归纳偏置如何影响缩放行为?这如何影响上游(预训练)和下游(转移)?本文对十种不同模型体系结构的缩放行为进行了系统研究,例如变压器,交换机变压器,通用变压器,动态卷积,表演者以及最近提出的MLP混合物。通过广泛的实验,我们表明(1)架构在执行缩放时确实是一个重要的考虑因素,并且(2)最佳性能模型可以在不同的尺度上波动。我们认为,这项工作中概述的发现对当前在社区中评估模型架构的方式具有重要意义。
translated by 谷歌翻译
基于变压器的大语言模型(LLM)的最新进展已导致许多任务的性能改进。这些收益随着模型的大小而大幅增加,可能导致推理时间缓慢且昂贵的使用。但是,实际上,LLMS制造的一代人由不同的难度组成。尽管某些预测确实从模型的全部容量中受益,但其他延续更为微不足道,可以通过减少的计算来解决。在这项工作中,我们介绍了自信的自适应语言建模(平静),该框架用于动态分配每个输入和生成时间段的不同计算。提前退出解码涉及我们在这里解决的几个挑战,例如:(1)使用什么信心措施; (2)将序列级别的约束连接到局部人口退出决策; (3)由于以前的令牌中的早期退出而返回丢失的隐藏表示形式。通过对三个不同文本生成任务的理论分析和经验实验,我们证明了框架在减少计算的效果 - 潜在的速度最高为$ \ times 3 $ - 同时可维持高性能。
translated by 谷歌翻译
转移学习是用于训练小型目标数据集深层网络的主要范式。通常在大型``上游''数据集上预估计用于分类的模型,因为此类标签易于收集,然后在``下游''任务(例如动作本地化)上进行了填充,这些任务由于其较细粒度的注释而较小。在本文中,我们质疑这种方法,并提出共同访问 - 同时在多个``上游''和``下游''任务上训练单个模型。我们证明,在使用相同的数据总量时,共同传统的表现优于传统的转移学习,并且还展示了我们如何轻松地将方法扩展到多个``上游''数据集以进一步提高性能。尤其是,共同访问可以显着提高我们下游任务中稀有类别的性能,因为它具有正规化的效果,并使网络能够学习在不同数据集之间传输的功能表示。最后,我们观察到如何与公共,视频分类数据集共同进行,我们能够在挑战性的AVA和AVA-Kinetics数据集上实现最新的时空动作的结果,超过了最新的作品,这些作品的最新作品会发展出复杂的作品楷模。
translated by 谷歌翻译
将简单的体系结构与大规模预训练相结合已导致图像分类的大量改进。对于对象检测,预训练和缩放方法的确定性不佳,尤其是在长尾和开放式摄影的环境中,训练数据相对较少。在本文中,我们提出了一个强大的配方,用于将图像文本模型转移到开放式对象检测中。我们使用具有最小修改,对比度文本预训练和端到端检测微调的标准视觉变压器体系结构。我们对该设置的缩放属性的分析表明,增加图像级预训练和模型大小在下游检测任务上产生一致的改进。我们提供适应性策略和正规化,以实现零击文本条件和单次图像条件对象检测的非常强劲的性能。代码和型号可在GitHub上找到。
translated by 谷歌翻译
用户界面建模本质上是多模式,这涉及几种不同类型的数据:图像,结构和语言。任务也是不同的,包括物体检测,语言生成和接地。在本文中,我们呈现VUT,这是一个多式联路的多媒体输入的多功能UI变压器,同时完成具有相同模型的5个不同的任务。我们的模型包括一个多模式变压器编码器,该编码器共同编码UI图像和结构,并且当输入中不存在UI结构时执行UI对象检测。我们的模型还包括一个自动回归变压器模型,用于对语言输入和解码输出进行编码,对于ui的问答和命令接地。我们的实验表明,对于大多数任务,当联合进行多项任务时,VUT基本上减少了执行多个任务所需的模型和足迹的数量,同时实现对每个任务培训的基线模型的准确性或与基准模型相符。
translated by 谷歌翻译